Bilingual Spoken Monologue Corpus for Simultaneous Machine Interpretation Research
نویسندگان
چکیده
Abstract This paper describes a large-scale bilingual corpus of spoken monologues and their simultaneous interpretation, which has been constructed at CIAIR. The corpus has the following characteristics: (1) English and Japanese speeches are recorded in parallel, (2) the data contains monologue speeches such as lecture and self-introduction, and (3) the exact beginning and ending times are provided for each utterance. We have collected a total of about 70 hours of speech data and transcribed them into ASCII text files. The corpus will be made publicly available in the near future. This paper also provides an analysis of the professional interpreter’s speeches using the bilingual corpus. The following points have been investigated: (1) the interpreting unit of simultaneous interpretation, (2) the difference between the beginning time of the lecturer’s utterance and that of the interpreter’s utterance, and (3) the interpreter’s speaking speed. The characteristic features about the timing at which simultaneous interpreters start to speak is presented. The analysis will be available for the development of a simultaneous machine interpreting system.
منابع مشابه
Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus
This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech si...
متن کاملConstruction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation
Abstract With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting r...
متن کاملInterpreting Unit Segmentation of Conversational Speech in Simultaneous Interpretation Corpus
The speech-to-speech translation system is becoming an important research topic with the progress of the speech and language processing technology. Considering efficiency and the smoothness of the cross-lingual conversation, the simultaneity of the translation processing has a great influence on the performance of the system. This paper describes interpreting unit segmentation of conversational...
متن کاملIncremental dependency parsing of Japanese spoken monologue based on clause boundaries
In applications of spoken monologue processing such as simultaneous machine interpretation and real-time captions generation, incremental language parsing is strongly required. This paper proposes a technique for incremental dependency parsing of Japanese spoken monologue on a clause-by-clause basis. The technique identifies the clauses based on clause boundaries analysis, analyzes the dependen...
متن کاملSpoken language corpus for machine interpretation research
This paper describes a database consisting of speech and language, which we are currently constructing for the purpose of the research on machine interpretation. The database contains bilingual data of lectures and dialogues. We have collected the speech of about 72 hours in total and transcribed it into the text manually. We have investigated the database in order to acquire empirical knowledg...
متن کامل